Re: [Xenomai] kernel crash issues - cobalt mode on ARM A53 with 32bit

Jerry Huang Mon, 22 May 2017 01:29:33 -0700

> -----Original Message-----
> From: Philippe Gerum [mailto:[email protected]]
> Sent: Monday, May 22, 2017 4:11 PM
> To: Jerry Huang <[email protected]>; Jorge Ramirez <[email protected]>;
> [email protected]
> Subject: Re: kernel crash issues - cobalt mode on ARM A53 with 32bit
> 
> On 05/22/2017 05:33 AM, Jerry Huang wrote:
> > Hi, all guys,
> > I want to make the e1000e work with cobalt mode on ARM A53 with 32bit,
> however, I encountered some critical issues.
> >
> > 1> I want to use PCIe MSIx or MSI mode, but that does not work, we need
> to use legacy INTx on PCIe, that can work.
> > Anyone can give some advice how to make PCIe MSIx/MSI interrupt work?
> >
> > 2> After modifying the e1000e driver to adapt the IPIPE interrupt mode
> with INTx, first, the e1000e NIC can work well, I can ping other machine, and
> can login other machine. But when the interrupt number reaches 1000 (that
> means there are 1000 interrupts on NIC), the issue is reported:
> > [ 1577.539977] [Xenomai] xnintr_irq_handler: IRQ83 not handled.
> > Disabling IRQ line # cat /proc/xenomai/irq
> >   IRQ         CPU0
> >    17:        6320         [timer/0]
> >    26:           9         fsl-ifc
> >    83:        1000         eth7
> >  1033:           0         [sync]
> >  1034:           0         [timer-ipi]
> >  1035:           0         [reschedule]
> >  1036:           0         [virtual]
> >  1040:           0         [virtual]
> >
> > Since that, the NIC can't work, must reboot the board.
> > Anyone can give some advice how to remove the interrupt number 1000
> limitation?
> 
> Check the interrupt handler in your driver, it does not return
> RTDM_IRQ_HANDLED upon success handling an IRQ.
> 
> >
> > 3> after booting up the Linux with Cobalt mode and e1000e NIC, and I
> don't' set the IP address (not use command  "ifconfig eth7 xx.xx.xx.xx up"),
> that means I don't enable the NIC card.
> 
> A quick check at both the e1000e driver code and the backtrace dump below
> reveals that the work queue handler that crashes starts running periodically
> when the NIC is probed, regardless of whether an IP address is set.
> 
> > After around 1 day, kernel crash as below, anyone can give some advice
> how to make the system stable?
> >
> 
> Around one day doing what? Idle, running Xenomai, running a common load?
> Is this reproducible without enabling Cobalt and/or the pipeline?
> 
> > [253287.272440] Unhandled fault: synchronous external abort (0x1210)
> > at 0xf05cb600 [253287.279740] pgd = 80203000 [253287.282523]
> > [f05cb600] *pgd=80000080207003, *pmd=ecb6b003, *pte=c00050400cb713
> > [253287.289831] Internal error: : 1210 [#1] SMP ARM [253287.294437]
> > Modules linked in: ipv6 [253287.298011] CPU: 0 PID: 4 Comm:
> > kworker/0:0 Not tainted 4.1.35-ipipe #1 [253287.304699] Hardware name:
> > Generic DT based system [253287.309571] Workqueue: events
> > e1000e_systim_overflow_work [253287.315047] task: ed860e40 ti:
> > ed878000 task.ti: ed878000 [253287.320523] PC is at
> > e1000e_cyclecounter_read+0x14/0x124
> > [253287.325913] LR is at timecounter_read+0x14/0x8c
> > [253287.330520] pc : [<808a3a74>]    lr : [<802bcad0>]    psr: 600d0013
> > [253287.330520] sp : ed879e68  ip : 00000000  fp : ee7a31c0
> > [253287.342157] r10: a014d0c8  r9 : 00000000  r8 : 00000000
> > [253287.347457] r7 : a014c4c0  r6 : a014f0c4  r5 : ed879ef0  r4 :
> > a014f0e0 [253287.354059] r3 : f05cb600  r2 : 00000000  r1 : 00000000
> > r0 : a014f0c8 [253287.360662] Flags: nZCv  IRQs on  FIQs on  Mode
> > SVC_32  ISA ARM  Segment kernel [253287.368045] Control: 30c5383d
> > Table: eb174fc0  DAC: fffffffd [253287.373866] Process kworker/0:0
> > (pid: 4, stack limit = 0xed878228) [253287.380121] Stack: (0xed879e68 to
> 0xed87a000)
> > [253287.384554] 9e60:                   7f03c874 03046c00 812c9400 eb184900
> 7f03c874 a014f0e0
> > [253287.392808] 9e80: ed879ef0 a014f0c4 a014c4c0 00000000 00000000
> > 00000000 ee7a31c0 802bcad0 [253287.401061] 9ea0: a014f078 a014f0c4
> > a014c4c0 00000000 00000000 00000000 ee7a31c0 808ae078 [253287.409315]
> > 9ec0: 00000000 00000001 81311b94 81311b94 a014f078 ed829980 ee7a31c0
> > ee7a6e00 [253287.417568] 9ee0: 00000000 00000000 ed829980 808ae1dc
> > ed814000 ee7a31c0 a014f078 ed829980 [253287.425821] 9f00: a014f078
> > 8027a588 ee7a31c0 ee7a31d4 ed878000 ee7a31c0 ed829998 ee7a31d4
> > [253287.434075] 9f20: ed878000 00000008 812803dc ed829980 ee7a31c0
> > 8027a8a0 8117c140 ee7a3324 [253287.442328] 9f40: 8027a854 00000000
> > ed82d000 ed829980 8027a854 00000000 00000000 00000000 [253287.450581]
> > 9f60: 00000000 8027f700 8f0141c7 00000000 382a8206 ed829980 00000000
> > 00000000 [253287.458834] 9f80: ed879f80 ed879f80 00000000 00000000
> > ed879f90 ed879f90 ed879fac ed82d000 [253287.467087] 9fa0: 8027f624
> > 00000000 00000000 80222f54 00000000 00000000 00000000 00000000
> > [253287.475340] 9fc0: 00000000 00000000 00000000 00000000 00000000
> > 00000000 00000000 00000000 [253287.483593] 9fe0: 00000000 00000000
> > 00000000 00000000 00000013 00000000 6822c08a 2600680a [253287.491851]
> > [<808a3a74>] (e1000e_cyclecounter_read) from [<802bcad0>]
> > (timecounter_read+0x14/0x8c) [253287.500889] [<802bcad0>]
> > (timecounter_read) from [<808ae078>] (e1000e_phc_gettime+0x34/0x6c)
> > [253287.509403] [<808ae078>] (e1000e_phc_gettime) from [<808ae1dc>]
> > (e1000e_systim_overflow_work+0x1c/0x44)
> > [253287.518875] [<808ae1dc>] (e1000e_systim_overflow_work) from
> > [<8027a588>] (process_one_work+0x12c/0x3f8) [253287.528347]
> > [<8027a588>] (process_one_work) from [<8027a8a0>]
> > (worker_thread+0x4c/0x530) [253287.536515] [<8027a8a0>]
> > (worker_thread) from [<8027f700>] (kthread+0xdc/0xf4) [253287.543816]
> > [<8027f700>] (kthread) from [<80222f54>] (ret_from_fork+0x18/0x24)
> > [253287.551115] Code: e240aa02 e24dd014 e51a37e0 e2833cb6 (e5936000)
> > [253287.557286] ---[ end trace 795e386dc7b45ae9 ]--- [253287.562873]
> > Unable to handle kernel paging request at virtual address ffffffec
> >
> 
> In the message above, you have all the information you need to start digging
> that issue. The "Unhandled fault" message is sent from a single place in the
> ARM kernel, i.e. do_DataAbort(), so this should ring a bell about the reason
> for that fault.
> 
> Since that fault is synchronous, you also know that the PC value reported in
> the message must be the address of the faulting instruction living in
> e1000e_cyclecounter_read(). Disassembling the vmlinux image will give you
> the exact instruction from the offset mentioned from the beginning of that
> routine.
> 
> From that point, you need to deduce the most probable cause by yourself,
> trying different configurations such as disabling PTP, to make sure the issue
> does not reappear elsewhere, showing some randomness, which would
> reveal a deeper problem.
> 
> For my part, I don't see any way to answer a question such as "how to make
> the system stable", except maybe debugging it.
> 
> --
Thanks,  Philippe.
I added RTDM_IRQ_HANDLED to irq hander, and no 1000 IRQs issue.
And I will redo the stable test.
For my test, I don't do anything, let the kernel idle after startup the Linux.


BTW, can we make MSIx or MSI work for PCIe? If can, how to do it?
Because I just can make INTx work for PCIe.
_______________________________________________
Xenomai mailing list
[email protected]
https://xenomai.org/mailman/listinfo/xenomai

Re: [Xenomai] kernel crash issues - cobalt mode on ARM A53 with 32bit

Reply via email to