Re: [Xenomai-core] Potential problem with rt_eepro100
Anders Blomdell wrote: Jan Kiszka wrote: Am 01.11.2010 17:55, Anders Blomdell wrote: Jan Kiszka wrote: Am 28.10.2010 11:34, Anders Blomdell wrote: Jan Kiszka wrote: Am 28.10.2010 09:34, Anders Blomdell wrote: Anders Blomdell wrote: Anders Blomdell wrote: Hi, I'm trying to use rt_eepro100, for sending raw ethernet packets, but I'm experincing occasionally weird behaviour. Versions of things: linux-2.6.34.5 xenomai-2.5.5.2 rtnet-39f7fcf The testprogram runs on two computers with Intel Corporation 82557/8/9/0/1 Ethernet Pro 100 (rev 08) controller, where one computer acts as a mirror sending back packets received from the ethernet (only those two computers on the network), and the other sends packets and measures roundtrip time. Most packets comes back in approximately 100 us, but occasionally the reception times out (once in about 10 packets or more), but the packets gets immediately received when reception is retried, which might indicate a race between rt_dev_recvmsg and interrupt, but I might miss something obvious. Changing one of the ethernet cards to a Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05), while keeping everything else constant, changes behavior somewhat; after receiving a few 10 packets, reception stops entirely (-EAGAIN is returned), while transmission proceeds as it should (and mirror returns packets). Any suggestions on what to try? Since the problem disappears with 'maxcpus=1', I suspect I have a SMP issue (machine is a Core2 Quad), so I'll move to xenomai-core. (original message can be found at http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se ) Xenomai-core gurus: which is the corrrect way to debug SMP issues? Can I run I-pipe-tracer and expect to be able save at least 150 us of traces for all cpus? Any hints/suggestions/insigths are welcome... The i-pipe tracer unfortunately only saves traces for a the CPU that triggered the freeze. To have a full pictures, you may want to try my ftrace port I posted recently for 2.6.35. 2.6.35.7 ? Exactly. Finally managed to get the ftrace to work (one possible bug: had to manually copy include/xenomai/trace/xn_nucleus.h to include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be very useful... But I don't think it will give much info at the moment, since no xenomai/ipipe interrupt activity shows up, and adding that is far above my league :-( You could use the function tracer, provided you are able to stop the trace quickly enough on error. My current theory is that the problem occurs when something like this takes place: CPU-iCPU-jCPU-kCPU-l rt_dev_sendmsg xmit_irq rt_dev_recvmsgrecv_irq Can't follow. When races here, and what will go wrong then? Thats the good question. Find attached: 1. .config (so you can check for stupid mistakes) 2. console log 3. latest version of test program 4. tail of ftrace dump These are the xenomai tasks running when the test program is active: CPU PIDCLASS PRI TIMEOUT TIMEBASE STAT NAME 0 0 idle-1 - master R ROOT/0 1 0 idle-1 - master R ROOT/1 2 0 idle-1 - master R ROOT/2 3 0 idle-1 - master R ROOT/3 0 0 rt 98 - master W rtnet-stack 0 0 rt 0 - master W rtnet-rtpc 0 29901 rt 50 - masterraw_test 0 29906 rt 0 - master X reporter The lines of interest from the trace are probably: [003] 2061.347855: xn_nucleus_thread_resume: thread=f9bf7b00 thread_name=rtnet-stack mask=2 [003] 2061.347862: xn_nucleus_sched: status=200 [000] 2061.347866: xn_nucleus_sched_remote: status=0 since this is the only place where a packet gets delayed, and the only place in the trace where sched_remote reports a status=0 Since the cpu that has rtnet-stack and hence should be resumed is doing heavy I/O at the time of fault; could it be that send_ipi/schedule_handler needs barriers to make sure taht decisions are made on the right status? /Anders ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] Potential problem with rt_eepro100
Am 03.11.2010 12:44, Anders Blomdell wrote: Anders Blomdell wrote: Jan Kiszka wrote: Am 01.11.2010 17:55, Anders Blomdell wrote: Jan Kiszka wrote: Am 28.10.2010 11:34, Anders Blomdell wrote: Jan Kiszka wrote: Am 28.10.2010 09:34, Anders Blomdell wrote: Anders Blomdell wrote: Anders Blomdell wrote: Hi, I'm trying to use rt_eepro100, for sending raw ethernet packets, but I'm experincing occasionally weird behaviour. Versions of things: linux-2.6.34.5 xenomai-2.5.5.2 rtnet-39f7fcf The testprogram runs on two computers with Intel Corporation 82557/8/9/0/1 Ethernet Pro 100 (rev 08) controller, where one computer acts as a mirror sending back packets received from the ethernet (only those two computers on the network), and the other sends packets and measures roundtrip time. Most packets comes back in approximately 100 us, but occasionally the reception times out (once in about 10 packets or more), but the packets gets immediately received when reception is retried, which might indicate a race between rt_dev_recvmsg and interrupt, but I might miss something obvious. Changing one of the ethernet cards to a Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05), while keeping everything else constant, changes behavior somewhat; after receiving a few 10 packets, reception stops entirely (-EAGAIN is returned), while transmission proceeds as it should (and mirror returns packets). Any suggestions on what to try? Since the problem disappears with 'maxcpus=1', I suspect I have a SMP issue (machine is a Core2 Quad), so I'll move to xenomai-core. (original message can be found at http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se ) Xenomai-core gurus: which is the corrrect way to debug SMP issues? Can I run I-pipe-tracer and expect to be able save at least 150 us of traces for all cpus? Any hints/suggestions/insigths are welcome... The i-pipe tracer unfortunately only saves traces for a the CPU that triggered the freeze. To have a full pictures, you may want to try my ftrace port I posted recently for 2.6.35. 2.6.35.7 ? Exactly. Finally managed to get the ftrace to work (one possible bug: had to manually copy include/xenomai/trace/xn_nucleus.h to include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be very useful... But I don't think it will give much info at the moment, since no xenomai/ipipe interrupt activity shows up, and adding that is far above my league :-( You could use the function tracer, provided you are able to stop the trace quickly enough on error. My current theory is that the problem occurs when something like this takes place: CPU-iCPU-jCPU-kCPU-l rt_dev_sendmsg xmit_irq rt_dev_recvmsgrecv_irq Can't follow. When races here, and what will go wrong then? Thats the good question. Find attached: 1. .config (so you can check for stupid mistakes) 2. console log 3. latest version of test program 4. tail of ftrace dump These are the xenomai tasks running when the test program is active: CPU PIDCLASS PRI TIMEOUT TIMEBASE STAT NAME 0 0 idle-1 - master R ROOT/0 1 0 idle-1 - master R ROOT/1 2 0 idle-1 - master R ROOT/2 3 0 idle-1 - master R ROOT/3 0 0 rt 98 - master W rtnet-stack 0 0 rt 0 - master W rtnet-rtpc 0 29901 rt 50 - masterraw_test 0 29906 rt 0 - master X reporter The lines of interest from the trace are probably: [003] 2061.347855: xn_nucleus_thread_resume: thread=f9bf7b00 thread_name=rtnet-stack mask=2 [003] 2061.347862: xn_nucleus_sched: status=200 [000] 2061.347866: xn_nucleus_sched_remote: status=0 since this is the only place where a packet gets delayed, and the only place in the trace where sched_remote reports a status=0 Since the cpu that has rtnet-stack and hence should be resumed is doing heavy I/O at the time of fault; could it be that send_ipi/schedule_handler needs barriers to make sure taht decisions are made on the right status? That was my first idea as well - but we should run all relevant code under nklock here. But please correct me if I miss something. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] Potential problem with rt_eepro100
Am 03.11.2010 12:50, Jan Kiszka wrote: Am 03.11.2010 12:44, Anders Blomdell wrote: Anders Blomdell wrote: Jan Kiszka wrote: Am 01.11.2010 17:55, Anders Blomdell wrote: Jan Kiszka wrote: Am 28.10.2010 11:34, Anders Blomdell wrote: Jan Kiszka wrote: Am 28.10.2010 09:34, Anders Blomdell wrote: Anders Blomdell wrote: Anders Blomdell wrote: Hi, I'm trying to use rt_eepro100, for sending raw ethernet packets, but I'm experincing occasionally weird behaviour. Versions of things: linux-2.6.34.5 xenomai-2.5.5.2 rtnet-39f7fcf The testprogram runs on two computers with Intel Corporation 82557/8/9/0/1 Ethernet Pro 100 (rev 08) controller, where one computer acts as a mirror sending back packets received from the ethernet (only those two computers on the network), and the other sends packets and measures roundtrip time. Most packets comes back in approximately 100 us, but occasionally the reception times out (once in about 10 packets or more), but the packets gets immediately received when reception is retried, which might indicate a race between rt_dev_recvmsg and interrupt, but I might miss something obvious. Changing one of the ethernet cards to a Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05), while keeping everything else constant, changes behavior somewhat; after receiving a few 10 packets, reception stops entirely (-EAGAIN is returned), while transmission proceeds as it should (and mirror returns packets). Any suggestions on what to try? Since the problem disappears with 'maxcpus=1', I suspect I have a SMP issue (machine is a Core2 Quad), so I'll move to xenomai-core. (original message can be found at http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se ) Xenomai-core gurus: which is the corrrect way to debug SMP issues? Can I run I-pipe-tracer and expect to be able save at least 150 us of traces for all cpus? Any hints/suggestions/insigths are welcome... The i-pipe tracer unfortunately only saves traces for a the CPU that triggered the freeze. To have a full pictures, you may want to try my ftrace port I posted recently for 2.6.35. 2.6.35.7 ? Exactly. Finally managed to get the ftrace to work (one possible bug: had to manually copy include/xenomai/trace/xn_nucleus.h to include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be very useful... But I don't think it will give much info at the moment, since no xenomai/ipipe interrupt activity shows up, and adding that is far above my league :-( You could use the function tracer, provided you are able to stop the trace quickly enough on error. My current theory is that the problem occurs when something like this takes place: CPU-iCPU-jCPU-kCPU-l rt_dev_sendmsg xmit_irq rt_dev_recvmsgrecv_irq Can't follow. When races here, and what will go wrong then? Thats the good question. Find attached: 1. .config (so you can check for stupid mistakes) 2. console log 3. latest version of test program 4. tail of ftrace dump These are the xenomai tasks running when the test program is active: CPU PIDCLASS PRI TIMEOUT TIMEBASE STAT NAME 0 0 idle-1 - master R ROOT/0 1 0 idle-1 - master R ROOT/1 2 0 idle-1 - master R ROOT/2 3 0 idle-1 - master R ROOT/3 0 0 rt 98 - master W rtnet-stack 0 0 rt 0 - master W rtnet-rtpc 0 29901 rt 50 - masterraw_test 0 29906 rt 0 - master X reporter The lines of interest from the trace are probably: [003] 2061.347855: xn_nucleus_thread_resume: thread=f9bf7b00 thread_name=rtnet-stack mask=2 [003] 2061.347862: xn_nucleus_sched: status=200 [000] 2061.347866: xn_nucleus_sched_remote: status=0 since this is the only place where a packet gets delayed, and the only place in the trace where sched_remote reports a status=0 Since the cpu that has rtnet-stack and hence should be resumed is doing heavy I/O at the time of fault; could it be that send_ipi/schedule_handler needs barriers to make sure taht decisions are made on the right status? That was my first idea as well - but we should run all relevant code under nklock here. But please correct me if I miss something. Mmmh -- not everything. The inlined XNRESCHED entry test in xnpod_schedule runs outside nklock. But doesn't releasing nklock imply a memory write barrier? Let me meditate... Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux ___ Xenomai-core mailing list Xenomai-core@gna.org
Re: [Xenomai-core] Potential problem with rt_eepro100
On 2010-11-03 12.55, Jan Kiszka wrote: Am 03.11.2010 12:50, Jan Kiszka wrote: Am 03.11.2010 12:44, Anders Blomdell wrote: Anders Blomdell wrote: Jan Kiszka wrote: Am 01.11.2010 17:55, Anders Blomdell wrote: Jan Kiszka wrote: Am 28.10.2010 11:34, Anders Blomdell wrote: Jan Kiszka wrote: Am 28.10.2010 09:34, Anders Blomdell wrote: Anders Blomdell wrote: Anders Blomdell wrote: Hi, I'm trying to use rt_eepro100, for sending raw ethernet packets, but I'm experincing occasionally weird behaviour. Versions of things: linux-2.6.34.5 xenomai-2.5.5.2 rtnet-39f7fcf The testprogram runs on two computers with Intel Corporation 82557/8/9/0/1 Ethernet Pro 100 (rev 08) controller, where one computer acts as a mirror sending back packets received from the ethernet (only those two computers on the network), and the other sends packets and measures roundtrip time. Most packets comes back in approximately 100 us, but occasionally the reception times out (once in about 10 packets or more), but the packets gets immediately received when reception is retried, which might indicate a race between rt_dev_recvmsg and interrupt, but I might miss something obvious. Changing one of the ethernet cards to a Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05), while keeping everything else constant, changes behavior somewhat; after receiving a few 10 packets, reception stops entirely (-EAGAIN is returned), while transmission proceeds as it should (and mirror returns packets). Any suggestions on what to try? Since the problem disappears with 'maxcpus=1', I suspect I have a SMP issue (machine is a Core2 Quad), so I'll move to xenomai-core. (original message can be found at http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se ) Xenomai-core gurus: which is the corrrect way to debug SMP issues? Can I run I-pipe-tracer and expect to be able save at least 150 us of traces for all cpus? Any hints/suggestions/insigths are welcome... The i-pipe tracer unfortunately only saves traces for a the CPU that triggered the freeze. To have a full pictures, you may want to try my ftrace port I posted recently for 2.6.35. 2.6.35.7 ? Exactly. Finally managed to get the ftrace to work (one possible bug: had to manually copy include/xenomai/trace/xn_nucleus.h to include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be very useful... But I don't think it will give much info at the moment, since no xenomai/ipipe interrupt activity shows up, and adding that is far above my league :-( You could use the function tracer, provided you are able to stop the trace quickly enough on error. My current theory is that the problem occurs when something like this takes place: CPU-iCPU-jCPU-kCPU-l rt_dev_sendmsg xmit_irq rt_dev_recvmsgrecv_irq Can't follow. When races here, and what will go wrong then? Thats the good question. Find attached: 1. .config (so you can check for stupid mistakes) 2. console log 3. latest version of test program 4. tail of ftrace dump These are the xenomai tasks running when the test program is active: CPU PIDCLASS PRI TIMEOUT TIMEBASE STAT NAME 0 0 idle-1 - master R ROOT/0 1 0 idle-1 - master R ROOT/1 2 0 idle-1 - master R ROOT/2 3 0 idle-1 - master R ROOT/3 0 0 rt 98 - master W rtnet-stack 0 0 rt 0 - master W rtnet-rtpc 0 29901 rt 50 - masterraw_test 0 29906 rt 0 - master X reporter The lines of interest from the trace are probably: [003] 2061.347855: xn_nucleus_thread_resume: thread=f9bf7b00 thread_name=rtnet-stack mask=2 [003] 2061.347862: xn_nucleus_sched: status=200 [000] 2061.347866: xn_nucleus_sched_remote: status=0 since this is the only place where a packet gets delayed, and the only place in the trace where sched_remote reports a status=0 Since the cpu that has rtnet-stack and hence should be resumed is doing heavy I/O at the time of fault; could it be that send_ipi/schedule_handler needs barriers to make sure taht decisions are made on the right status? That was my first idea as well - but we should run all relevant code under nklock here. But please correct me if I miss something. Wouldn't we need a write-barrier before the send_ipi regardless of what locks we hold, otherwise no guarantees that the memory write reaches the target cpu before the interrupt does? Mmmh -- not everything. The inlined XNRESCHED entry test in xnpod_schedule runs outside nklock. But doesn't releasing nklock imply a memory write barrier? Let me meditate... Wouldn't
Re: [Xenomai-core] Potential problem with rt_eepro100
Am 03.11.2010 13:07, Anders Blomdell wrote: On 2010-11-03 12.55, Jan Kiszka wrote: Am 03.11.2010 12:50, Jan Kiszka wrote: Am 03.11.2010 12:44, Anders Blomdell wrote: Anders Blomdell wrote: Jan Kiszka wrote: Am 01.11.2010 17:55, Anders Blomdell wrote: Jan Kiszka wrote: Am 28.10.2010 11:34, Anders Blomdell wrote: Jan Kiszka wrote: Am 28.10.2010 09:34, Anders Blomdell wrote: Anders Blomdell wrote: Anders Blomdell wrote: Hi, I'm trying to use rt_eepro100, for sending raw ethernet packets, but I'm experincing occasionally weird behaviour. Versions of things: linux-2.6.34.5 xenomai-2.5.5.2 rtnet-39f7fcf The testprogram runs on two computers with Intel Corporation 82557/8/9/0/1 Ethernet Pro 100 (rev 08) controller, where one computer acts as a mirror sending back packets received from the ethernet (only those two computers on the network), and the other sends packets and measures roundtrip time. Most packets comes back in approximately 100 us, but occasionally the reception times out (once in about 10 packets or more), but the packets gets immediately received when reception is retried, which might indicate a race between rt_dev_recvmsg and interrupt, but I might miss something obvious. Changing one of the ethernet cards to a Intel Corporation 82541PI Gigabit Ethernet Controller (rev 05), while keeping everything else constant, changes behavior somewhat; after receiving a few 10 packets, reception stops entirely (-EAGAIN is returned), while transmission proceeds as it should (and mirror returns packets). Any suggestions on what to try? Since the problem disappears with 'maxcpus=1', I suspect I have a SMP issue (machine is a Core2 Quad), so I'll move to xenomai-core. (original message can be found at http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se ) Xenomai-core gurus: which is the corrrect way to debug SMP issues? Can I run I-pipe-tracer and expect to be able save at least 150 us of traces for all cpus? Any hints/suggestions/insigths are welcome... The i-pipe tracer unfortunately only saves traces for a the CPU that triggered the freeze. To have a full pictures, you may want to try my ftrace port I posted recently for 2.6.35. 2.6.35.7 ? Exactly. Finally managed to get the ftrace to work (one possible bug: had to manually copy include/xenomai/trace/xn_nucleus.h to include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be very useful... But I don't think it will give much info at the moment, since no xenomai/ipipe interrupt activity shows up, and adding that is far above my league :-( You could use the function tracer, provided you are able to stop the trace quickly enough on error. My current theory is that the problem occurs when something like this takes place: CPU-iCPU-jCPU-kCPU-l rt_dev_sendmsg xmit_irq rt_dev_recvmsgrecv_irq Can't follow. When races here, and what will go wrong then? Thats the good question. Find attached: 1. .config (so you can check for stupid mistakes) 2. console log 3. latest version of test program 4. tail of ftrace dump These are the xenomai tasks running when the test program is active: CPU PIDCLASS PRI TIMEOUT TIMEBASE STAT NAME 0 0 idle-1 - master R ROOT/0 1 0 idle-1 - master R ROOT/1 2 0 idle-1 - master R ROOT/2 3 0 idle-1 - master R ROOT/3 0 0 rt 98 - master W rtnet-stack 0 0 rt 0 - master W rtnet-rtpc 0 29901 rt 50 - masterraw_test 0 29906 rt 0 - master X reporter The lines of interest from the trace are probably: [003] 2061.347855: xn_nucleus_thread_resume: thread=f9bf7b00 thread_name=rtnet-stack mask=2 [003] 2061.347862: xn_nucleus_sched: status=200 [000] 2061.347866: xn_nucleus_sched_remote: status=0 since this is the only place where a packet gets delayed, and the only place in the trace where sched_remote reports a status=0 Since the cpu that has rtnet-stack and hence should be resumed is doing heavy I/O at the time of fault; could it be that send_ipi/schedule_handler needs barriers to make sure taht decisions are made on the right status? That was my first idea as well - but we should run all relevant code under nklock here. But please correct me if I miss something. Wouldn't we need a write-barrier before the send_ipi regardless of what locks we hold, otherwise no guarantees that the memory write reaches the target cpu before the interrupt does? Yeah, the problem is that if xnpod_resume_thread and the next xnpod_reschedule are under the same nklock, we won't issue the barrier as we
Re: [Xenomai-core] Potential problem with rt_eepro100
Jan Kiszka wrote: additional barrier. Can you check this? diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h index df56417..66b52ad 100644 --- a/include/nucleus/sched.h +++ b/include/nucleus/sched.h @@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct xnsched *sched) if (current_sched != (__sched__)){ \ xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched); \ setbits((__sched__)-status, XNRESCHED); \ + xnarch_memory_barrier(); \ }\ } while (0) In progress, if nothing breaks before, I'll report status tomorrow morning. Mmmh -- not everything. The inlined XNRESCHED entry test in xnpod_schedule runs outside nklock. But doesn't releasing nklock imply a memory write barrier? Let me meditate... Wouldn't we need a read barrier then (but maybe the irq-handling takes care of that, not familiar with the code yet)? A read barrier is not required here as we do not need to order load operation /wrt each other in the reschedule IRQ handler. Only if taking the interrupt is equivalent to: read interrupts status memory_read_barrier execute handler processor manuals should have the answer to this (or it might already be in the code)... You can always help: there is a lot boring^Winteresting tracepoint conversion waiting in Xenomai, see the few already converted nucleus tracepoints. As soon as I have my system running, I'll put some effort into this. /Anders ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] Potential problem with rt_eepro100
Anders Blomdell wrote: Jan Kiszka wrote: additional barrier. Can you check this? diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h index df56417..66b52ad 100644 --- a/include/nucleus/sched.h +++ b/include/nucleus/sched.h @@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct xnsched *sched) if (current_sched != (__sched__)){\ xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched);\ setbits((__sched__)-status, XNRESCHED);\ + xnarch_memory_barrier();\ }\ } while (0) In progress, if nothing breaks before, I'll report status tomorrow morning. It still breaks (in approximately the same way). I'm currently putting a barrier in the other macro doing a RESCHED, also adding some tracing to see if a read barrier is needed. Interesting side-note: Harddisk accesses seems to get real slow after error has occured (kernel installs progresses with 2-3 modules installed per second), while lots of idle time reported on all cpu's, weird... /Anders ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] Potential problem with rt_eepro100
Anders Blomdell wrote: Anders Blomdell wrote: Jan Kiszka wrote: additional barrier. Can you check this? diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h index df56417..66b52ad 100644 --- a/include/nucleus/sched.h +++ b/include/nucleus/sched.h @@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct xnsched *sched) if (current_sched != (__sched__)){\ xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched);\ setbits((__sched__)-status, XNRESCHED);\ + xnarch_memory_barrier();\ }\ } while (0) In progress, if nothing breaks before, I'll report status tomorrow morning. It still breaks (in approximately the same way). I'm currently putting a barrier in the other macro doing a RESCHED, also adding some tracing to see if a read barrier is needed. Nope, no luck there either. Will start interesting tracepoint adding/conversion :-( Any reason why xn_nucleus_sched_remote should ever report status = 0? /Anders ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] Potential problem with rt_eepro100
Am 03.11.2010 17:46, Anders Blomdell wrote: Anders Blomdell wrote: Anders Blomdell wrote: Jan Kiszka wrote: additional barrier. Can you check this? diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h index df56417..66b52ad 100644 --- a/include/nucleus/sched.h +++ b/include/nucleus/sched.h @@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct xnsched *sched) if (current_sched != (__sched__)){\ xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched);\ setbits((__sched__)-status, XNRESCHED);\ + xnarch_memory_barrier();\ }\ } while (0) In progress, if nothing breaks before, I'll report status tomorrow morning. It still breaks (in approximately the same way). I'm currently putting a barrier in the other macro doing a RESCHED, also adding some tracing to see if a read barrier is needed. Nope, no luck there either. Will start interesting tracepoint adding/conversion :-( Strange. But it was too easy anyway... Any reason why xn_nucleus_sched_remote should ever report status = 0? Really don't know yet. You could trigger on this state and call ftrace_stop() then. Provided you had the functions tracer enabled, that should give a nice pictures of what happened before. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] Potential problem with rt_eepro100
Jan Kiszka wrote: Am 03.11.2010 17:46, Anders Blomdell wrote: Anders Blomdell wrote: Anders Blomdell wrote: Jan Kiszka wrote: additional barrier. Can you check this? diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h index df56417..66b52ad 100644 --- a/include/nucleus/sched.h +++ b/include/nucleus/sched.h @@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct xnsched *sched) if (current_sched != (__sched__)){\ xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched);\ setbits((__sched__)-status, XNRESCHED);\ + xnarch_memory_barrier();\ }\ } while (0) In progress, if nothing breaks before, I'll report status tomorrow morning. It still breaks (in approximately the same way). I'm currently putting a barrier in the other macro doing a RESCHED, also adding some tracing to see if a read barrier is needed. Nope, no luck there either. Will start interesting tracepoint adding/conversion :-( Strange. But it was too easy anyway... Any reason why xn_nucleus_sched_remote should ever report status = 0? Really don't know yet. You could trigger on this state and call ftrace_stop() then. Provided you had the functions tracer enabled, that should give a nice pictures of what happened before. Isn't there a race betweeen these two (still waiting for compilation to be finished)? static inline int __xnpod_test_resched(struct xnsched *sched) { int resched = testbits(sched-status, XNRESCHED); #ifdef CONFIG_SMP /* Send resched IPI to remote CPU(s). */ if (unlikely(xnsched_resched_p(sched))) { xnarch_send_ipi(sched-resched); xnarch_cpus_clear(sched-resched); } #endif clrbits(sched-status, XNRESCHED); return resched; } #define xnsched_set_resched(__sched__) do { \ xnsched_t *current_sched = xnpod_current_sched(); \ setbits(current_sched-status, XNRESCHED); \ if (current_sched != (__sched__)) { \ xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched); \ setbits((__sched__)-status, XNRESCHED);\ xnarch_memory_barrier();\ } \ } while (0) I would suggest (if I have got all the macros right): static inline int __xnpod_test_resched(struct xnsched *sched) { int resched = testbits(sched-status, XNRESCHED); if (unlikely(resched)) { #ifdef CONFIG_SMP /* Send resched IPI to remote CPU(s). */ xnarch_send_ipi(sched-resched); xnarch_cpus_clear(sched-resched); #endif clrbits(sched-status, XNRESCHED); } return resched; } /Anders ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] Potential problem with rt_eepro100
On Wed, 2010-11-03 at 20:38 +0100, Anders Blomdell wrote: Jan Kiszka wrote: Am 03.11.2010 17:46, Anders Blomdell wrote: Anders Blomdell wrote: Anders Blomdell wrote: Jan Kiszka wrote: additional barrier. Can you check this? diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h index df56417..66b52ad 100644 --- a/include/nucleus/sched.h +++ b/include/nucleus/sched.h @@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct xnsched *sched) if (current_sched != (__sched__)){\ xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched);\ setbits((__sched__)-status, XNRESCHED);\ + xnarch_memory_barrier();\ }\ } while (0) In progress, if nothing breaks before, I'll report status tomorrow morning. It still breaks (in approximately the same way). I'm currently putting a barrier in the other macro doing a RESCHED, also adding some tracing to see if a read barrier is needed. Nope, no luck there either. Will start interesting tracepoint adding/conversion :-( Strange. But it was too easy anyway... Any reason why xn_nucleus_sched_remote should ever report status = 0? Really don't know yet. You could trigger on this state and call ftrace_stop() then. Provided you had the functions tracer enabled, that should give a nice pictures of what happened before. Isn't there a race betweeen these two (still waiting for compilation to be finished)? We always hold the nklock in both contexts. static inline int __xnpod_test_resched(struct xnsched *sched) { int resched = testbits(sched-status, XNRESCHED); #ifdef CONFIG_SMP /* Send resched IPI to remote CPU(s). */ if (unlikely(xnsched_resched_p(sched))) { xnarch_send_ipi(sched-resched); xnarch_cpus_clear(sched-resched); } #endif clrbits(sched-status, XNRESCHED); return resched; } #define xnsched_set_resched(__sched__) do { \ xnsched_t *current_sched = xnpod_current_sched(); \ setbits(current_sched-status, XNRESCHED); \ if (current_sched != (__sched__)) { \ xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched); \ setbits((__sched__)-status, XNRESCHED);\ xnarch_memory_barrier();\ } \ } while (0) I would suggest (if I have got all the macros right): static inline int __xnpod_test_resched(struct xnsched *sched) { int resched = testbits(sched-status, XNRESCHED); if (unlikely(resched)) { #ifdef CONFIG_SMP /* Send resched IPI to remote CPU(s). */ xnarch_send_ipi(sched-resched); xnarch_cpus_clear(sched-resched); #endif clrbits(sched-status, XNRESCHED); } return resched; } /Anders ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core -- Philippe. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
[Xenomai-core] Is anybody using the pSOS skin in userland?
Hello we are investigating to usage of the pSOS+ skin to port a large legacy pSOS application to Linux. The application model consist of several processes in which the application lives. All processes will make use of the pSOS library. After playing around with the library for some time we have observed several missing service calls, bugs and differences in behaviour compared to a real pSOS implementation: - missing sm_ident - missing t_getreg / t_setreg in userland (patch already included in 2.5.5) - not possible to use skin from the context of different processes (patch already included in 2.5.5) - added support for identical task/queue/semaphore/region names by making names unique. - strange behaviour in pSOS message queue (see post Possible memory leak in psos skin message queue handling). I can (and will) deliver patches for all issues I have found, but I'm wondering whether there are other people using the pSOS skin (in userland) in a real live application. The target for my project would be an embedded system with strong reliability requirements (very stable / long running etc). Any feedback is welcome and appreciated. It is not clear to me either which tests are executed before a new version is released. Is there any test-suite available for the pSOS skin? Best regards, Ronny ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] Potential problem with rt_eepro100
Am 03.11.2010 21:41, Philippe Gerum wrote: On Wed, 2010-11-03 at 20:38 +0100, Anders Blomdell wrote: Jan Kiszka wrote: Am 03.11.2010 17:46, Anders Blomdell wrote: Anders Blomdell wrote: Anders Blomdell wrote: Jan Kiszka wrote: additional barrier. Can you check this? diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h index df56417..66b52ad 100644 --- a/include/nucleus/sched.h +++ b/include/nucleus/sched.h @@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct xnsched *sched) if (current_sched != (__sched__)){\ xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched);\ setbits((__sched__)-status, XNRESCHED);\ + xnarch_memory_barrier();\ }\ } while (0) In progress, if nothing breaks before, I'll report status tomorrow morning. It still breaks (in approximately the same way). I'm currently putting a barrier in the other macro doing a RESCHED, also adding some tracing to see if a read barrier is needed. Nope, no luck there either. Will start interesting tracepoint adding/conversion :-( Strange. But it was too easy anyway... Any reason why xn_nucleus_sched_remote should ever report status = 0? Really don't know yet. You could trigger on this state and call ftrace_stop() then. Provided you had the functions tracer enabled, that should give a nice pictures of what happened before. Isn't there a race betweeen these two (still waiting for compilation to be finished)? We always hold the nklock in both contexts. But we not not always use atomic ops for manipulating status bits (but we do in other cases where this is no need - different story). This may fix the race: diff --git a/ksrc/nucleus/intr.c b/ksrc/nucleus/intr.c index d7a772f..af8ebeb 100644 --- a/ksrc/nucleus/intr.c +++ b/ksrc/nucleus/intr.c @@ -85,7 +85,7 @@ static void xnintr_irq_handler(unsigned irq, void *cookie); void xnintr_host_tick(struct xnsched *sched) /* Interrupts off. */ { - __clrbits(sched-status, XNHTICK); + clrbits(sched-status, XNHTICK); xnarch_relay_tick(); } @@ -105,11 +105,13 @@ void xnintr_clock_handler(void) trace_mark(xn_nucleus, irq_enter, irq %u, XNARCH_TIMER_IRQ); trace_mark(xn_nucleus, tbase_tick, base %s, nktbase.name); + xnlock_get(nklock); + ++sched-inesting; __setbits(sched-status, XNINIRQ); - xnlock_get(nklock); xntimer_tick_aperiodic(); + xnlock_put(nklock); xnstat_counter_inc(nkclock.stat[xnsched_cpu(sched)].hits); @@ -117,7 +119,7 @@ void xnintr_clock_handler(void) nkclock.stat[xnsched_cpu(sched)].account, start); if (--sched-inesting == 0) { - __clrbits(sched-status, XNINIRQ); + clrbits(sched-status, XNINIRQ); xnpod_schedule(); } /* @@ -178,7 +180,7 @@ static void xnintr_shirq_handler(unsigned irq, void *cookie) trace_mark(xn_nucleus, irq_enter, irq %u, irq); ++sched-inesting; - __setbits(sched-status, XNINIRQ); + setbits(sched-status, XNINIRQ); xnlock_get(shirq-lock); intr = shirq-handlers; @@ -220,7 +222,7 @@ static void xnintr_shirq_handler(unsigned irq, void *cookie) xnarch_end_irq(irq); if (--sched-inesting == 0) { - __clrbits(sched-status, XNINIRQ); + clrbits(sched-status, XNINIRQ); xnpod_schedule(); } @@ -247,7 +249,7 @@ static void xnintr_edge_shirq_handler(unsigned irq, void *cookie) trace_mark(xn_nucleus, irq_enter, irq %u, irq); ++sched-inesting; - __setbits(sched-status, XNINIRQ); + setbits(sched-status, XNINIRQ); xnlock_get(shirq-lock); intr = shirq-handlers; @@ -303,7 +305,7 @@ static void xnintr_edge_shirq_handler(unsigned irq, void *cookie) xnarch_end_irq(irq); if (--sched-inesting == 0) { - __clrbits(sched-status, XNINIRQ); + clrbits(sched-status, XNINIRQ); xnpod_schedule(); } trace_mark(xn_nucleus, irq_exit, irq %u, irq); @@ -446,7 +448,7 @@ static void xnintr_irq_handler(unsigned irq, void *cookie) trace_mark(xn_nucleus, irq_enter, irq %u, irq); ++sched-inesting; - __setbits(sched-status, XNINIRQ); + setbits(sched-status, XNINIRQ); xnlock_get(xnirqs[irq].lock); @@ -493,7 +495,7 @@ static void xnintr_irq_handler(unsigned irq, void *cookie) xnarch_end_irq(irq); if (--sched-inesting == 0) { - __clrbits(sched-status, XNINIRQ); + clrbits(sched-status, XNINIRQ); xnpod_schedule(); } Jan signature.asc Description: OpenPGP digital signature ___ Xenomai-core mailing list Xenomai-core@gna.org
Re: [Xenomai-core] Potential problem with rt_eepro100
Am 03.11.2010 23:03, Jan Kiszka wrote: Am 03.11.2010 21:41, Philippe Gerum wrote: On Wed, 2010-11-03 at 20:38 +0100, Anders Blomdell wrote: Jan Kiszka wrote: Am 03.11.2010 17:46, Anders Blomdell wrote: Anders Blomdell wrote: Anders Blomdell wrote: Jan Kiszka wrote: additional barrier. Can you check this? diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h index df56417..66b52ad 100644 --- a/include/nucleus/sched.h +++ b/include/nucleus/sched.h @@ -187,6 +187,7 @@ static inline int xnsched_self_resched_p(struct xnsched *sched) if (current_sched != (__sched__)){\ xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched);\ setbits((__sched__)-status, XNRESCHED);\ + xnarch_memory_barrier();\ }\ } while (0) In progress, if nothing breaks before, I'll report status tomorrow morning. It still breaks (in approximately the same way). I'm currently putting a barrier in the other macro doing a RESCHED, also adding some tracing to see if a read barrier is needed. Nope, no luck there either. Will start interesting tracepoint adding/conversion :-( Strange. But it was too easy anyway... Any reason why xn_nucleus_sched_remote should ever report status = 0? Really don't know yet. You could trigger on this state and call ftrace_stop() then. Provided you had the functions tracer enabled, that should give a nice pictures of what happened before. Isn't there a race betweeen these two (still waiting for compilation to be finished)? We always hold the nklock in both contexts. But we not not always use atomic ops for manipulating status bits (but we do in other cases where this is no need - different story). This may fix the race: Err, nonsense. As we manipulate xnsched::status also outside of nklock protection, we must _always_ use atomic ops. This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ should be pushed in a separate status word that can then be safely modified non-atomically. Jan signature.asc Description: OpenPGP digital signature ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] Potential problem with rt_eepro100
Am 03.11.2010 23:11, Jan Kiszka wrote: Am 03.11.2010 23:03, Jan Kiszka wrote: But we not not always use atomic ops for manipulating status bits (but we do in other cases where this is no need - different story). This may fix the race: Err, nonsense. As we manipulate xnsched::status also outside of nklock protection, we must _always_ use atomic ops. This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ should be pushed in a separate status word that can then be safely modified non-atomically. Second try to fix and clean up the sched status bits. Anders, please test. Jan diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h index 01ff0a7..5987a1f 100644 --- a/include/nucleus/pod.h +++ b/include/nucleus/pod.h @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void) * context is active, or if we are caught in the middle of a * unlocked context switch. */ -#if XENO_DEBUG(NUCLEUS) if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK)) return; -#else /* !XENO_DEBUG(NUCLEUS) */ - if (testbits(sched-status, -XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED) +#if !XENO_DEBUG(NUCLEUS) + if (!sched-resched) return; #endif /* !XENO_DEBUG(NUCLEUS) */ diff --git a/include/nucleus/sched.h b/include/nucleus/sched.h index df56417..1850208 100644 --- a/include/nucleus/sched.h +++ b/include/nucleus/sched.h @@ -44,7 +44,6 @@ #define XNINTCK0x1000 /* In master tick handler context */ #define XNINIRQ0x0800 /* In IRQ handling context */ #define XNSWLOCK 0x0400 /* In context switch */ -#define XNRESCHED 0x0200 /* Needs rescheduling */ #define XNHDEFER 0x0100 /* Host tick deferred */ struct xnsched_rt { @@ -63,7 +62,8 @@ typedef struct xnsched { xnflags_t status; /*! Scheduler specific status bitmask. */ int cpu; struct xnthread *curr; /*! Current thread. */ - xnarch_cpumask_t resched; /*! Mask of CPUs needing rescheduling. */ + xnarch_cpumask_t remote_resched; /*! Mask of CPUs needing rescheduling. */ + int resched;/*! Rescheduling needed. */ struct xnsched_rt rt; /*! Context of built-in real-time class. */ #ifdef CONFIG_XENO_OPT_SCHED_TP @@ -164,30 +164,21 @@ struct xnsched_class { #define xnsched_cpu(__sched__) ({ (void)__sched__; 0; }) #endif /* CONFIG_SMP */ -/* Test all resched flags from the given scheduler mask. */ -static inline int xnsched_resched_p(struct xnsched *sched) -{ - return testbits(sched-status, XNRESCHED); -} - -static inline int xnsched_self_resched_p(struct xnsched *sched) -{ - return testbits(sched-status, XNRESCHED); -} - /* Set self resched flag for the given scheduler. */ #define xnsched_set_self_resched(__sched__) do { \ - setbits((__sched__)-status, XNRESCHED); \ + (__sched__)-resched = 1; \ } while (0) /* Set specific resched flag into the local scheduler mask. */ #define xnsched_set_resched(__sched__) do {\ - xnsched_t *current_sched = xnpod_current_sched();\ - setbits(current_sched-status, XNRESCHED); \ - if (current_sched != (__sched__)){ \ - xnarch_cpu_set(xnsched_cpu(__sched__), current_sched-resched); \ - setbits((__sched__)-status, XNRESCHED); \ - }\ + xnsched_t *current_sched = xnpod_current_sched(); \ + current_sched-resched = 1; \ + if (current_sched != (__sched__)) { \ + xnarch_cpu_set(xnsched_cpu(__sched__), \ + current_sched-remote_resched); \ + (__sched__)-resched = 1; \ + xnarch_memory_barrier();\ + } \ } while (0) void xnsched_zombie_hooks(struct xnthread *thread); @@ -209,7 +200,7 @@ struct xnsched *xnsched_finish_unlocked_switch(struct xnsched *sched); static inline int xnsched_maybe_resched_after_unlocked_switch(struct xnsched *sched) { - return testbits(sched-status, XNRESCHED); + return sched-resched; } #else /* !CONFIG_XENO_HW_UNLOCKED_SWITCH */ diff --git a/ksrc/nucleus/pod.c b/ksrc/nucleus/pod.c index 9e135f3..f7f8b2c 100644 --- a/ksrc/nucleus/pod.c +++ b/ksrc/nucleus/pod.c @@ -284,7 +284,7 @@ void xnpod_schedule_handler(void) /* Called with hw interrupts off. */ trace_xn_nucleus_sched_remote(sched); #if defined(CONFIG_SMP) defined(CONFIG_XENO_OPT_PRIOCPL) if
Re: [Xenomai-core] Potential problem with rt_eepro100
Jan Kiszka wrote: Am 03.11.2010 23:11, Jan Kiszka wrote: Am 03.11.2010 23:03, Jan Kiszka wrote: But we not not always use atomic ops for manipulating status bits (but we do in other cases where this is no need - different story). This may fix the race: Err, nonsense. As we manipulate xnsched::status also outside of nklock protection, we must _always_ use atomic ops. This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ should be pushed in a separate status word that can then be safely modified non-atomically. Second try to fix and clean up the sched status bits. Anders, please test. Jan diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h index 01ff0a7..5987a1f 100644 --- a/include/nucleus/pod.h +++ b/include/nucleus/pod.h @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void) * context is active, or if we are caught in the middle of a * unlocked context switch. */ -#if XENO_DEBUG(NUCLEUS) if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK)) return; -#else /* !XENO_DEBUG(NUCLEUS) */ - if (testbits(sched-status, - XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED) +#if !XENO_DEBUG(NUCLEUS) + if (!sched-resched) return; #endif /* !XENO_DEBUG(NUCLEUS) */ Having only one test was really nice here, maybe we simply read a barrier before reading the status? -- Gilles. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] Potential problem with rt_eepro100
Am 04.11.2010 00:11, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 03.11.2010 23:11, Jan Kiszka wrote: Am 03.11.2010 23:03, Jan Kiszka wrote: But we not not always use atomic ops for manipulating status bits (but we do in other cases where this is no need - different story). This may fix the race: Err, nonsense. As we manipulate xnsched::status also outside of nklock protection, we must _always_ use atomic ops. This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ should be pushed in a separate status word that can then be safely modified non-atomically. Second try to fix and clean up the sched status bits. Anders, please test. Jan diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h index 01ff0a7..5987a1f 100644 --- a/include/nucleus/pod.h +++ b/include/nucleus/pod.h @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void) * context is active, or if we are caught in the middle of a * unlocked context switch. */ -#if XENO_DEBUG(NUCLEUS) if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK)) return; -#else /* !XENO_DEBUG(NUCLEUS) */ -if (testbits(sched-status, - XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED) +#if !XENO_DEBUG(NUCLEUS) +if (!sched-resched) return; #endif /* !XENO_DEBUG(NUCLEUS) */ Having only one test was really nice here, maybe we simply read a barrier before reading the status? I agree - but the alternative is letting all modifications of xnsched::status use atomic bitops (that's required when folding all bits into a single word). And that should be much more costly, specifically on SMP. Jan signature.asc Description: OpenPGP digital signature ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] Potential problem with rt_eepro100
Jan Kiszka wrote: Am 04.11.2010 00:11, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 03.11.2010 23:11, Jan Kiszka wrote: Am 03.11.2010 23:03, Jan Kiszka wrote: But we not not always use atomic ops for manipulating status bits (but we do in other cases where this is no need - different story). This may fix the race: Err, nonsense. As we manipulate xnsched::status also outside of nklock protection, we must _always_ use atomic ops. This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ should be pushed in a separate status word that can then be safely modified non-atomically. Second try to fix and clean up the sched status bits. Anders, please test. Jan diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h index 01ff0a7..5987a1f 100644 --- a/include/nucleus/pod.h +++ b/include/nucleus/pod.h @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void) * context is active, or if we are caught in the middle of a * unlocked context switch. */ -#if XENO_DEBUG(NUCLEUS) if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK)) return; -#else /* !XENO_DEBUG(NUCLEUS) */ - if (testbits(sched-status, -XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED) +#if !XENO_DEBUG(NUCLEUS) + if (!sched-resched) return; #endif /* !XENO_DEBUG(NUCLEUS) */ Having only one test was really nice here, maybe we simply read a barrier before reading the status? I agree - but the alternative is letting all modifications of xnsched::status use atomic bitops (that's required when folding all bits into a single word). And that should be much more costly, specifically on SMP. What about issuing a barrier before testing the status? -- Gilles. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] Potential problem with rt_eepro100
Am 04.11.2010 00:18, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 04.11.2010 00:11, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 03.11.2010 23:11, Jan Kiszka wrote: Am 03.11.2010 23:03, Jan Kiszka wrote: But we not not always use atomic ops for manipulating status bits (but we do in other cases where this is no need - different story). This may fix the race: Err, nonsense. As we manipulate xnsched::status also outside of nklock protection, we must _always_ use atomic ops. This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ should be pushed in a separate status word that can then be safely modified non-atomically. Second try to fix and clean up the sched status bits. Anders, please test. Jan diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h index 01ff0a7..5987a1f 100644 --- a/include/nucleus/pod.h +++ b/include/nucleus/pod.h @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void) * context is active, or if we are caught in the middle of a * unlocked context switch. */ -#if XENO_DEBUG(NUCLEUS) if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK)) return; -#else /* !XENO_DEBUG(NUCLEUS) */ - if (testbits(sched-status, - XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED) +#if !XENO_DEBUG(NUCLEUS) + if (!sched-resched) return; #endif /* !XENO_DEBUG(NUCLEUS) */ Having only one test was really nice here, maybe we simply read a barrier before reading the status? I agree - but the alternative is letting all modifications of xnsched::status use atomic bitops (that's required when folding all bits into a single word). And that should be much more costly, specifically on SMP. What about issuing a barrier before testing the status? The problem is not about reading but writing the status concurrently, thus it's not about the code you see above. Jan signature.asc Description: OpenPGP digital signature ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] Potential problem with rt_eepro100
Jan Kiszka wrote: Am 04.11.2010 00:18, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 04.11.2010 00:11, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 03.11.2010 23:11, Jan Kiszka wrote: Am 03.11.2010 23:03, Jan Kiszka wrote: But we not not always use atomic ops for manipulating status bits (but we do in other cases where this is no need - different story). This may fix the race: Err, nonsense. As we manipulate xnsched::status also outside of nklock protection, we must _always_ use atomic ops. This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ should be pushed in a separate status word that can then be safely modified non-atomically. Second try to fix and clean up the sched status bits. Anders, please test. Jan diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h index 01ff0a7..5987a1f 100644 --- a/include/nucleus/pod.h +++ b/include/nucleus/pod.h @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void) * context is active, or if we are caught in the middle of a * unlocked context switch. */ -#if XENO_DEBUG(NUCLEUS) if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK)) return; -#else /* !XENO_DEBUG(NUCLEUS) */ - if (testbits(sched-status, - XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED) +#if !XENO_DEBUG(NUCLEUS) + if (!sched-resched) return; #endif /* !XENO_DEBUG(NUCLEUS) */ Having only one test was really nice here, maybe we simply read a barrier before reading the status? I agree - but the alternative is letting all modifications of xnsched::status use atomic bitops (that's required when folding all bits into a single word). And that should be much more costly, specifically on SMP. What about issuing a barrier before testing the status? The problem is not about reading but writing the status concurrently, thus it's not about the code you see above. The bits are modified under nklock, which implies a barrier when unlocked. Furthermore, an IPI is guaranteed to be received on the remote CPU after this barrier, so, a barrier should be enough to see the modifications which have been made remotely. -- Gilles. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] Potential problem with rt_eepro100
Am 04.11.2010 00:44, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 04.11.2010 00:18, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 04.11.2010 00:11, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 03.11.2010 23:11, Jan Kiszka wrote: Am 03.11.2010 23:03, Jan Kiszka wrote: But we not not always use atomic ops for manipulating status bits (but we do in other cases where this is no need - different story). This may fix the race: Err, nonsense. As we manipulate xnsched::status also outside of nklock protection, we must _always_ use atomic ops. This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ should be pushed in a separate status word that can then be safely modified non-atomically. Second try to fix and clean up the sched status bits. Anders, please test. Jan diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h index 01ff0a7..5987a1f 100644 --- a/include/nucleus/pod.h +++ b/include/nucleus/pod.h @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void) * context is active, or if we are caught in the middle of a * unlocked context switch. */ -#if XENO_DEBUG(NUCLEUS) if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK)) return; -#else /* !XENO_DEBUG(NUCLEUS) */ -if (testbits(sched-status, - XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED) +#if !XENO_DEBUG(NUCLEUS) +if (!sched-resched) return; #endif /* !XENO_DEBUG(NUCLEUS) */ Having only one test was really nice here, maybe we simply read a barrier before reading the status? I agree - but the alternative is letting all modifications of xnsched::status use atomic bitops (that's required when folding all bits into a single word). And that should be much more costly, specifically on SMP. What about issuing a barrier before testing the status? The problem is not about reading but writing the status concurrently, thus it's not about the code you see above. The bits are modified under nklock, which implies a barrier when unlocked. Furthermore, an IPI is guaranteed to be received on the remote CPU after this barrier, so, a barrier should be enough to see the modifications which have been made remotely. Check nucleus/intr.c for tons of unprotected status modifications. Jan signature.asc Description: OpenPGP digital signature ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] Potential problem with rt_eepro100
Jan Kiszka wrote: Am 04.11.2010 00:44, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 04.11.2010 00:18, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 04.11.2010 00:11, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 03.11.2010 23:11, Jan Kiszka wrote: Am 03.11.2010 23:03, Jan Kiszka wrote: But we not not always use atomic ops for manipulating status bits (but we do in other cases where this is no need - different story). This may fix the race: Err, nonsense. As we manipulate xnsched::status also outside of nklock protection, we must _always_ use atomic ops. This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ should be pushed in a separate status word that can then be safely modified non-atomically. Second try to fix and clean up the sched status bits. Anders, please test. Jan diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h index 01ff0a7..5987a1f 100644 --- a/include/nucleus/pod.h +++ b/include/nucleus/pod.h @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void) * context is active, or if we are caught in the middle of a * unlocked context switch. */ -#if XENO_DEBUG(NUCLEUS) if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK)) return; -#else /* !XENO_DEBUG(NUCLEUS) */ - if (testbits(sched-status, -XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED) +#if !XENO_DEBUG(NUCLEUS) + if (!sched-resched) return; #endif /* !XENO_DEBUG(NUCLEUS) */ Having only one test was really nice here, maybe we simply read a barrier before reading the status? I agree - but the alternative is letting all modifications of xnsched::status use atomic bitops (that's required when folding all bits into a single word). And that should be much more costly, specifically on SMP. What about issuing a barrier before testing the status? The problem is not about reading but writing the status concurrently, thus it's not about the code you see above. The bits are modified under nklock, which implies a barrier when unlocked. Furthermore, an IPI is guaranteed to be received on the remote CPU after this barrier, so, a barrier should be enough to see the modifications which have been made remotely. Check nucleus/intr.c for tons of unprotected status modifications. Ok. Then maybe, we should reconsider the original decision to start fiddling with the XNRESCHED bit remotely. -- Gilles. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] Potential problem with rt_eepro100
Am 04.11.2010 00:56, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 04.11.2010 00:44, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 04.11.2010 00:18, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 04.11.2010 00:11, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 03.11.2010 23:11, Jan Kiszka wrote: Am 03.11.2010 23:03, Jan Kiszka wrote: But we not not always use atomic ops for manipulating status bits (but we do in other cases where this is no need - different story). This may fix the race: Err, nonsense. As we manipulate xnsched::status also outside of nklock protection, we must _always_ use atomic ops. This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ should be pushed in a separate status word that can then be safely modified non-atomically. Second try to fix and clean up the sched status bits. Anders, please test. Jan diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h index 01ff0a7..5987a1f 100644 --- a/include/nucleus/pod.h +++ b/include/nucleus/pod.h @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void) * context is active, or if we are caught in the middle of a * unlocked context switch. */ -#if XENO_DEBUG(NUCLEUS) if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK)) return; -#else /* !XENO_DEBUG(NUCLEUS) */ - if (testbits(sched-status, - XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED) +#if !XENO_DEBUG(NUCLEUS) + if (!sched-resched) return; #endif /* !XENO_DEBUG(NUCLEUS) */ Having only one test was really nice here, maybe we simply read a barrier before reading the status? I agree - but the alternative is letting all modifications of xnsched::status use atomic bitops (that's required when folding all bits into a single word). And that should be much more costly, specifically on SMP. What about issuing a barrier before testing the status? The problem is not about reading but writing the status concurrently, thus it's not about the code you see above. The bits are modified under nklock, which implies a barrier when unlocked. Furthermore, an IPI is guaranteed to be received on the remote CPU after this barrier, so, a barrier should be enough to see the modifications which have been made remotely. Check nucleus/intr.c for tons of unprotected status modifications. Ok. Then maybe, we should reconsider the original decision to start fiddling with the XNRESCHED bit remotely. ...which removed complexity and fixed a race? Let's better review the checks done in xnpod_schedule vs. its callers, I bet there is more to save (IOW: remove the need to test for sched-resched). Jan signature.asc Description: OpenPGP digital signature ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core
Re: [Xenomai-core] Potential problem with rt_eepro100
Jan Kiszka wrote: Am 04.11.2010 00:56, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 04.11.2010 00:44, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 04.11.2010 00:18, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 04.11.2010 00:11, Gilles Chanteperdrix wrote: Jan Kiszka wrote: Am 03.11.2010 23:11, Jan Kiszka wrote: Am 03.11.2010 23:03, Jan Kiszka wrote: But we not not always use atomic ops for manipulating status bits (but we do in other cases where this is no need - different story). This may fix the race: Err, nonsense. As we manipulate xnsched::status also outside of nklock protection, we must _always_ use atomic ops. This screams for a cleanup: local-only bits like XNHTICK or XNINIRQ should be pushed in a separate status word that can then be safely modified non-atomically. Second try to fix and clean up the sched status bits. Anders, please test. Jan diff --git a/include/nucleus/pod.h b/include/nucleus/pod.h index 01ff0a7..5987a1f 100644 --- a/include/nucleus/pod.h +++ b/include/nucleus/pod.h @@ -277,12 +277,10 @@ static inline void xnpod_schedule(void) * context is active, or if we are caught in the middle of a * unlocked context switch. */ -#if XENO_DEBUG(NUCLEUS) if (testbits(sched-status, XNKCOUT|XNINIRQ|XNSWLOCK)) return; -#else /* !XENO_DEBUG(NUCLEUS) */ - if (testbits(sched-status, - XNKCOUT|XNINIRQ|XNSWLOCK|XNRESCHED) != XNRESCHED) +#if !XENO_DEBUG(NUCLEUS) + if (!sched-resched) return; #endif /* !XENO_DEBUG(NUCLEUS) */ Having only one test was really nice here, maybe we simply read a barrier before reading the status? I agree - but the alternative is letting all modifications of xnsched::status use atomic bitops (that's required when folding all bits into a single word). And that should be much more costly, specifically on SMP. What about issuing a barrier before testing the status? The problem is not about reading but writing the status concurrently, thus it's not about the code you see above. The bits are modified under nklock, which implies a barrier when unlocked. Furthermore, an IPI is guaranteed to be received on the remote CPU after this barrier, so, a barrier should be enough to see the modifications which have been made remotely. Check nucleus/intr.c for tons of unprotected status modifications. Ok. Then maybe, we should reconsider the original decision to start fiddling with the XNRESCHED bit remotely. ...which removed complexity and fixed a race? Let's better review the checks done in xnpod_schedule vs. its callers, I bet there is more to save (IOW: remove the need to test for sched-resched). Not that much complexitiy... and the race was a false positive in debug code, no big deal. At least it worked, and it has done so for a long time. No atomic needed, no barrier, only one test in xnpod_schedule. And a nice invariant: sched-status is always accessed on the local cpu. What else? -- Gilles. ___ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core