Am 03.11.2010 12:50, Jan Kiszka wrote: > Am 03.11.2010 12:44, Anders Blomdell wrote: >> Anders Blomdell wrote: >>> Jan Kiszka wrote: >>>> Am 01.11.2010 17:55, Anders Blomdell wrote: >>>>> Jan Kiszka wrote: >>>>>> Am 28.10.2010 11:34, Anders Blomdell wrote: >>>>>>> Jan Kiszka wrote: >>>>>>>> Am 28.10.2010 09:34, Anders Blomdell wrote: >>>>>>>>> Anders Blomdell wrote: >>>>>>>>>> Anders Blomdell wrote: >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I'm trying to use rt_eepro100, for sending raw ethernet packets, >>>>>>>>>>> but I'm >>>>>>>>>>> experincing occasionally weird behaviour. >>>>>>>>>>> >>>>>>>>>>> Versions of things: >>>>>>>>>>> >>>>>>>>>>> linux-2.6.34.5 >>>>>>>>>>> xenomai-2.5.5.2 >>>>>>>>>>> rtnet-39f7fcf >>>>>>>>>>> >>>>>>>>>>> The testprogram runs on two computers with "Intel Corporation >>>>>>>>>>> 82557/8/9/0/1 Ethernet Pro 100 (rev 08)" controller, where one >>>>>>>>>>> computer >>>>>>>>>>> acts as a mirror sending back packets received from the ethernet >>>>>>>>>>> (only >>>>>>>>>>> those two computers on the network), and the other sends >>>>>>>>>>> packets and >>>>>>>>>>> measures roundtrip time. Most packets comes back in approximately >>>>>>>>>>> 100 >>>>>>>>>>> us, but occasionally the reception times out (once in about >>>>>>>>>>> 100000 >>>>>>>>>>> packets or more), but the packets gets immediately received when >>>>>>>>>>> reception is retried, which might indicate a race between >>>>>>>>>>> rt_dev_recvmsg >>>>>>>>>>> and interrupt, but I might miss something obvious. >>>>>>>>>> Changing one of the ethernet cards to a "Intel Corporation 82541PI >>>>>>>>>> Gigabit Ethernet Controller (rev 05)", while keeping everything >>>>>>>>>> else >>>>>>>>>> constant, changes behavior somewhat; after receiving a few 100000 >>>>>>>>>> packets, reception stops entirely (-EAGAIN is returned), while >>>>>>>>>> transmission proceeds as it should (and mirror returns packets). >>>>>>>>>> >>>>>>>>>> Any suggestions on what to try? >>>>>>>>> Since the problem disappears with 'maxcpus=1', I suspect I have >>>>>>>>> a SMP >>>>>>>>> issue (machine is a Core2 Quad), so I'll move to xenomai-core. >>>>>>>>> (original message can be found at >>>>>>>>> http://sourceforge.net/mailarchive/message.php?msg_name=4CC82C8D.3080808%40control.lth.se >>>>>>>>> >>>>>>>>> >>>>>>>>> ) >>>>>>>>> >>>>>>>>> Xenomai-core gurus: which is the corrrect way to debug SMP issues? >>>>>>>>> Can I run I-pipe-tracer and expect to be able save at least 150 >>>>>>>>> us of >>>>>>>>> traces for all cpus? Any hints/suggestions/insigths are welcome... >>>>>>>> The i-pipe tracer unfortunately only saves traces for a the CPU that >>>>>>>> triggered the freeze. To have a full pictures, you may want to >>>>>>>> try my >>>>>>>> ftrace port I posted recently for 2.6.35. >>>>>>> 2.6.35.7 ? >>>>>>> >>>>>> Exactly. >>>>> Finally managed to get the ftrace to work >>>>> (one possible bug: had to manually copy >>>>> include/xenomai/trace/xn_nucleus.h to >>>>> include/xenomai/trace/events/xn_nucleus.h), and it looks like it can be >>>>> very useful... >>>>> >>>>> But I don't think it will give much info at the moment, since no >>>>> xenomai/ipipe interrupt activity shows up, and adding that is far above >>>>> my league :-( >>>> >>>> You could use the function tracer, provided you are able to stop the >>>> trace quickly enough on error. >>>> >>>>> My current theory is that the problem occurs when something like this >>>>> takes place: >>>>> >>>>> CPU-i CPU-j CPU-k CPU-l >>>>> >>>>> rt_dev_sendmsg >>>>> xmit_irq >>>>> rt_dev_recvmsg recv_irq >>>> >>>> Can't follow. When races here, and what will go wrong then? >>> Thats the good question. Find attached: >>> >>> 1. .config (so you can check for stupid mistakes) >>> 2. console log >>> 3. latest version of test program >>> 4. tail of ftrace dump >>> >>> These are the xenomai tasks running when the test program is active: >>> >>> CPU PID CLASS PRI TIMEOUT TIMEBASE STAT NAME >>> 0 0 idle -1 - master R ROOT/0 >>> 1 0 idle -1 - master R ROOT/1 >>> 2 0 idle -1 - master R ROOT/2 >>> 3 0 idle -1 - master R ROOT/3 >>> 0 0 rt 98 - master W rtnet-stack >>> 0 0 rt 0 - master W rtnet-rtpc >>> 0 29901 rt 50 - master raw_test >>> 0 29906 rt 0 - master X reporter >>> >>> >>> >>> The lines of interest from the trace are probably: >>> >>> [003] 2061.347855: xn_nucleus_thread_resume: thread=f9bf7b00 >>> thread_name=rtnet-stack mask=2 >>> [003] 2061.347862: xn_nucleus_sched: status=2000000 >>> [000] 2061.347866: xn_nucleus_sched_remote: status=0 >>> >>> since this is the only place where a packet gets delayed, and the only >>> place in the trace where sched_remote reports a status=0 >> Since the cpu that has rtnet-stack and hence should be resumed is doing >> heavy I/O at the time of fault; could it be that >> send_ipi/schedule_handler needs barriers to make sure taht decisions are >> made on the right status? > > That was my first idea as well - but we should run all relevant code > under nklock here. But please correct me if I miss something.
Mmmh -- not everything. The inlined XNRESCHED entry test in xnpod_schedule runs outside nklock. But doesn't releasing nklock imply a memory write barrier? Let me meditate... Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux _______________________________________________ Xenomai-core mailing list Xenomai-core@gna.org https://mail.gna.org/listinfo/xenomai-core