> -----Original Message----- > From: Philippe Gerum [mailto:[email protected]] > Sent: Monday, May 22, 2017 4:11 PM > To: Jerry Huang <[email protected]>; Jorge Ramirez <[email protected]>; > [email protected] > Subject: Re: kernel crash issues - cobalt mode on ARM A53 with 32bit > > On 05/22/2017 05:33 AM, Jerry Huang wrote: > > Hi, all guys, > > I want to make the e1000e work with cobalt mode on ARM A53 with 32bit, > however, I encountered some critical issues. > > > > 1> I want to use PCIe MSIx or MSI mode, but that does not work, we need > to use legacy INTx on PCIe, that can work. > > Anyone can give some advice how to make PCIe MSIx/MSI interrupt work? > > > > 2> After modifying the e1000e driver to adapt the IPIPE interrupt mode > with INTx, first, the e1000e NIC can work well, I can ping other machine, and > can login other machine. But when the interrupt number reaches 1000 (that > means there are 1000 interrupts on NIC), the issue is reported: > > [ 1577.539977] [Xenomai] xnintr_irq_handler: IRQ83 not handled. > > Disabling IRQ line # cat /proc/xenomai/irq > > IRQ CPU0 > > 17: 6320 [timer/0] > > 26: 9 fsl-ifc > > 83: 1000 eth7 > > 1033: 0 [sync] > > 1034: 0 [timer-ipi] > > 1035: 0 [reschedule] > > 1036: 0 [virtual] > > 1040: 0 [virtual] > > > > Since that, the NIC can't work, must reboot the board. > > Anyone can give some advice how to remove the interrupt number 1000 > limitation? > > Check the interrupt handler in your driver, it does not return > RTDM_IRQ_HANDLED upon success handling an IRQ. > > > > > 3> after booting up the Linux with Cobalt mode and e1000e NIC, and I > don't' set the IP address (not use command "ifconfig eth7 xx.xx.xx.xx up"), > that means I don't enable the NIC card. > > A quick check at both the e1000e driver code and the backtrace dump below > reveals that the work queue handler that crashes starts running periodically > when the NIC is probed, regardless of whether an IP address is set. > > > After around 1 day, kernel crash as below, anyone can give some advice > how to make the system stable? > > > > Around one day doing what? Idle, running Xenomai, running a common load? > Is this reproducible without enabling Cobalt and/or the pipeline? > > > [253287.272440] Unhandled fault: synchronous external abort (0x1210) > > at 0xf05cb600 [253287.279740] pgd = 80203000 [253287.282523] > > [f05cb600] *pgd=80000080207003, *pmd=ecb6b003, *pte=c00050400cb713 > > [253287.289831] Internal error: : 1210 [#1] SMP ARM [253287.294437] > > Modules linked in: ipv6 [253287.298011] CPU: 0 PID: 4 Comm: > > kworker/0:0 Not tainted 4.1.35-ipipe #1 [253287.304699] Hardware name: > > Generic DT based system [253287.309571] Workqueue: events > > e1000e_systim_overflow_work [253287.315047] task: ed860e40 ti: > > ed878000 task.ti: ed878000 [253287.320523] PC is at > > e1000e_cyclecounter_read+0x14/0x124 > > [253287.325913] LR is at timecounter_read+0x14/0x8c > > [253287.330520] pc : [<808a3a74>] lr : [<802bcad0>] psr: 600d0013 > > [253287.330520] sp : ed879e68 ip : 00000000 fp : ee7a31c0 > > [253287.342157] r10: a014d0c8 r9 : 00000000 r8 : 00000000 > > [253287.347457] r7 : a014c4c0 r6 : a014f0c4 r5 : ed879ef0 r4 : > > a014f0e0 [253287.354059] r3 : f05cb600 r2 : 00000000 r1 : 00000000 > > r0 : a014f0c8 [253287.360662] Flags: nZCv IRQs on FIQs on Mode > > SVC_32 ISA ARM Segment kernel [253287.368045] Control: 30c5383d > > Table: eb174fc0 DAC: fffffffd [253287.373866] Process kworker/0:0 > > (pid: 4, stack limit = 0xed878228) [253287.380121] Stack: (0xed879e68 to > 0xed87a000) > > [253287.384554] 9e60: 7f03c874 03046c00 812c9400 eb184900 > 7f03c874 a014f0e0 > > [253287.392808] 9e80: ed879ef0 a014f0c4 a014c4c0 00000000 00000000 > > 00000000 ee7a31c0 802bcad0 [253287.401061] 9ea0: a014f078 a014f0c4 > > a014c4c0 00000000 00000000 00000000 ee7a31c0 808ae078 [253287.409315] > > 9ec0: 00000000 00000001 81311b94 81311b94 a014f078 ed829980 ee7a31c0 > > ee7a6e00 [253287.417568] 9ee0: 00000000 00000000 ed829980 808ae1dc > > ed814000 ee7a31c0 a014f078 ed829980 [253287.425821] 9f00: a014f078 > > 8027a588 ee7a31c0 ee7a31d4 ed878000 ee7a31c0 ed829998 ee7a31d4 > > [253287.434075] 9f20: ed878000 00000008 812803dc ed829980 ee7a31c0 > > 8027a8a0 8117c140 ee7a3324 [253287.442328] 9f40: 8027a854 00000000 > > ed82d000 ed829980 8027a854 00000000 00000000 00000000 [253287.450581] > > 9f60: 00000000 8027f700 8f0141c7 00000000 382a8206 ed829980 00000000 > > 00000000 [253287.458834] 9f80: ed879f80 ed879f80 00000000 00000000 > > ed879f90 ed879f90 ed879fac ed82d000 [253287.467087] 9fa0: 8027f624 > > 00000000 00000000 80222f54 00000000 00000000 00000000 00000000 > > [253287.475340] 9fc0: 00000000 00000000 00000000 00000000 00000000 > > 00000000 00000000 00000000 [253287.483593] 9fe0: 00000000 00000000 > > 00000000 00000000 00000013 00000000 6822c08a 2600680a [253287.491851] > > [<808a3a74>] (e1000e_cyclecounter_read) from [<802bcad0>] > > (timecounter_read+0x14/0x8c) [253287.500889] [<802bcad0>] > > (timecounter_read) from [<808ae078>] (e1000e_phc_gettime+0x34/0x6c) > > [253287.509403] [<808ae078>] (e1000e_phc_gettime) from [<808ae1dc>] > > (e1000e_systim_overflow_work+0x1c/0x44) > > [253287.518875] [<808ae1dc>] (e1000e_systim_overflow_work) from > > [<8027a588>] (process_one_work+0x12c/0x3f8) [253287.528347] > > [<8027a588>] (process_one_work) from [<8027a8a0>] > > (worker_thread+0x4c/0x530) [253287.536515] [<8027a8a0>] > > (worker_thread) from [<8027f700>] (kthread+0xdc/0xf4) [253287.543816] > > [<8027f700>] (kthread) from [<80222f54>] (ret_from_fork+0x18/0x24) > > [253287.551115] Code: e240aa02 e24dd014 e51a37e0 e2833cb6 (e5936000) > > [253287.557286] ---[ end trace 795e386dc7b45ae9 ]--- [253287.562873] > > Unable to handle kernel paging request at virtual address ffffffec > > > > In the message above, you have all the information you need to start digging > that issue. The "Unhandled fault" message is sent from a single place in the > ARM kernel, i.e. do_DataAbort(), so this should ring a bell about the reason > for that fault. > > Since that fault is synchronous, you also know that the PC value reported in > the message must be the address of the faulting instruction living in > e1000e_cyclecounter_read(). Disassembling the vmlinux image will give you > the exact instruction from the offset mentioned from the beginning of that > routine. > > From that point, you need to deduce the most probable cause by yourself, > trying different configurations such as disabling PTP, to make sure the issue > does not reappear elsewhere, showing some randomness, which would > reveal a deeper problem. > > For my part, I don't see any way to answer a question such as "how to make > the system stable", except maybe debugging it. > > -- Thanks, Philippe. I added RTDM_IRQ_HANDLED to irq hander, and no 1000 IRQs issue. And I will redo the stable test. For my test, I don't do anything, let the kernel idle after startup the Linux.
BTW, can we make MSIx or MSI work for PCIe? If can, how to do it? Because I just can make INTx work for PCIe. _______________________________________________ Xenomai mailing list [email protected] https://xenomai.org/mailman/listinfo/xenomai
