Re: [Xenomai] kernel crash issues - cobalt mode on ARM A53 with 32bit

Philippe Gerum Fri, 26 May 2017 09:40:46 -0700

On 05/22/2017 10:28 AM, Jerry Huang wrote:
> 
>> -----Original Message-----
>> From: Philippe Gerum [mailto:[email protected]]
>> Sent: Monday, May 22, 2017 4:11 PM
>> To: Jerry Huang <[email protected]>; Jorge Ramirez <[email protected]>;
>> [email protected]
>> Subject: Re: kernel crash issues - cobalt mode on ARM A53 with 32bit
>>
>> On 05/22/2017 05:33 AM, Jerry Huang wrote:
>>> Hi, all guys,
>>> I want to make the e1000e work with cobalt mode on ARM A53 with 32bit,
>> however, I encountered some critical issues.
>>>
>>> 1> I want to use PCIe MSIx or MSI mode, but that does not work, we need
>> to use legacy INTx on PCIe, that can work.
>>> Anyone can give some advice how to make PCIe MSIx/MSI interrupt work?
>>>
>>> 2> After modifying the e1000e driver to adapt the IPIPE interrupt mode
>> with INTx, first, the e1000e NIC can work well, I can ping other machine, and
>> can login other machine. But when the interrupt number reaches 1000 (that
>> means there are 1000 interrupts on NIC), the issue is reported:
>>> [ 1577.539977] [Xenomai] xnintr_irq_handler: IRQ83 not handled.
>>> Disabling IRQ line # cat /proc/xenomai/irq
>>>   IRQ         CPU0
>>>    17:        6320         [timer/0]
>>>    26:           9         fsl-ifc
>>>    83:        1000         eth7
>>>  1033:           0         [sync]
>>>  1034:           0         [timer-ipi]
>>>  1035:           0         [reschedule]
>>>  1036:           0         [virtual]
>>>  1040:           0         [virtual]
>>>
>>> Since that, the NIC can't work, must reboot the board.
>>> Anyone can give some advice how to remove the interrupt number 1000
>> limitation?
>>
>> Check the interrupt handler in your driver, it does not return
>> RTDM_IRQ_HANDLED upon success handling an IRQ.
>>
>>>
>>> 3> after booting up the Linux with Cobalt mode and e1000e NIC, and I
>> don't' set the IP address (not use command  "ifconfig eth7 xx.xx.xx.xx up"),
>> that means I don't enable the NIC card.
>>
>> A quick check at both the e1000e driver code and the backtrace dump below
>> reveals that the work queue handler that crashes starts running periodically
>> when the NIC is probed, regardless of whether an IP address is set.
>>
>>> After around 1 day, kernel crash as below, anyone can give some advice
>> how to make the system stable?
>>>
>>
>> Around one day doing what? Idle, running Xenomai, running a common load?
>> Is this reproducible without enabling Cobalt and/or the pipeline?
>>
>>> [253287.272440] Unhandled fault: synchronous external abort (0x1210)
>>> at 0xf05cb600 [253287.279740] pgd = 80203000 [253287.282523]
>>> [f05cb600] *pgd=80000080207003, *pmd=ecb6b003, *pte=c00050400cb713
>>> [253287.289831] Internal error: : 1210 [#1] SMP ARM [253287.294437]
>>> Modules linked in: ipv6 [253287.298011] CPU: 0 PID: 4 Comm:
>>> kworker/0:0 Not tainted 4.1.35-ipipe #1 [253287.304699] Hardware name:
>>> Generic DT based system [253287.309571] Workqueue: events
>>> e1000e_systim_overflow_work [253287.315047] task: ed860e40 ti:
>>> ed878000 task.ti: ed878000 [253287.320523] PC is at
>>> e1000e_cyclecounter_read+0x14/0x124
>>> [253287.325913] LR is at timecounter_read+0x14/0x8c
>>> [253287.330520] pc : [<808a3a74>]    lr : [<802bcad0>]    psr: 600d0013
>>> [253287.330520] sp : ed879e68  ip : 00000000  fp : ee7a31c0
>>> [253287.342157] r10: a014d0c8  r9 : 00000000  r8 : 00000000
>>> [253287.347457] r7 : a014c4c0  r6 : a014f0c4  r5 : ed879ef0  r4 :
>>> a014f0e0 [253287.354059] r3 : f05cb600  r2 : 00000000  r1 : 00000000
>>> r0 : a014f0c8 [253287.360662] Flags: nZCv  IRQs on  FIQs on  Mode
>>> SVC_32  ISA ARM  Segment kernel [253287.368045] Control: 30c5383d
>>> Table: eb174fc0  DAC: fffffffd [253287.373866] Process kworker/0:0
>>> (pid: 4, stack limit = 0xed878228) [253287.380121] Stack: (0xed879e68 to
>> 0xed87a000)
>>> [253287.384554] 9e60:                   7f03c874 03046c00 812c9400 eb184900
>> 7f03c874 a014f0e0
>>> [253287.392808] 9e80: ed879ef0 a014f0c4 a014c4c0 00000000 00000000
>>> 00000000 ee7a31c0 802bcad0 [253287.401061] 9ea0: a014f078 a014f0c4
>>> a014c4c0 00000000 00000000 00000000 ee7a31c0 808ae078 [253287.409315]
>>> 9ec0: 00000000 00000001 81311b94 81311b94 a014f078 ed829980 ee7a31c0
>>> ee7a6e00 [253287.417568] 9ee0: 00000000 00000000 ed829980 808ae1dc
>>> ed814000 ee7a31c0 a014f078 ed829980 [253287.425821] 9f00: a014f078
>>> 8027a588 ee7a31c0 ee7a31d4 ed878000 ee7a31c0 ed829998 ee7a31d4
>>> [253287.434075] 9f20: ed878000 00000008 812803dc ed829980 ee7a31c0
>>> 8027a8a0 8117c140 ee7a3324 [253287.442328] 9f40: 8027a854 00000000
>>> ed82d000 ed829980 8027a854 00000000 00000000 00000000 [253287.450581]
>>> 9f60: 00000000 8027f700 8f0141c7 00000000 382a8206 ed829980 00000000
>>> 00000000 [253287.458834] 9f80: ed879f80 ed879f80 00000000 00000000
>>> ed879f90 ed879f90 ed879fac ed82d000 [253287.467087] 9fa0: 8027f624
>>> 00000000 00000000 80222f54 00000000 00000000 00000000 00000000
>>> [253287.475340] 9fc0: 00000000 00000000 00000000 00000000 00000000
>>> 00000000 00000000 00000000 [253287.483593] 9fe0: 00000000 00000000
>>> 00000000 00000000 00000013 00000000 6822c08a 2600680a [253287.491851]
>>> [<808a3a74>] (e1000e_cyclecounter_read) from [<802bcad0>]
>>> (timecounter_read+0x14/0x8c) [253287.500889] [<802bcad0>]
>>> (timecounter_read) from [<808ae078>] (e1000e_phc_gettime+0x34/0x6c)
>>> [253287.509403] [<808ae078>] (e1000e_phc_gettime) from [<808ae1dc>]
>>> (e1000e_systim_overflow_work+0x1c/0x44)
>>> [253287.518875] [<808ae1dc>] (e1000e_systim_overflow_work) from
>>> [<8027a588>] (process_one_work+0x12c/0x3f8) [253287.528347]
>>> [<8027a588>] (process_one_work) from [<8027a8a0>]
>>> (worker_thread+0x4c/0x530) [253287.536515] [<8027a8a0>]
>>> (worker_thread) from [<8027f700>] (kthread+0xdc/0xf4) [253287.543816]
>>> [<8027f700>] (kthread) from [<80222f54>] (ret_from_fork+0x18/0x24)
>>> [253287.551115] Code: e240aa02 e24dd014 e51a37e0 e2833cb6 (e5936000)
>>> [253287.557286] ---[ end trace 795e386dc7b45ae9 ]--- [253287.562873]
>>> Unable to handle kernel paging request at virtual address ffffffec
>>>
>>
>> In the message above, you have all the information you need to start digging
>> that issue. The "Unhandled fault" message is sent from a single place in the
>> ARM kernel, i.e. do_DataAbort(), so this should ring a bell about the reason
>> for that fault.
>>
>> Since that fault is synchronous, you also know that the PC value reported in
>> the message must be the address of the faulting instruction living in
>> e1000e_cyclecounter_read(). Disassembling the vmlinux image will give you
>> the exact instruction from the offset mentioned from the beginning of that
>> routine.
>>
>> From that point, you need to deduce the most probable cause by yourself,
>> trying different configurations such as disabling PTP, to make sure the issue
>> does not reappear elsewhere, showing some randomness, which would
>> reveal a deeper problem.
>>
>> For my part, I don't see any way to answer a question such as "how to make
>> the system stable", except maybe debugging it.
>>
>> --
> Thanks,  Philippe.
> I added RTDM_IRQ_HANDLED to irq hander, and no 1000 IRQs issue.
> And I will redo the stable test.
> For my test, I don't do anything, let the kernel idle after startup the Linux.
> 
> BTW, can we make MSIx or MSI work for PCIe? If can, how to do it?
> Because I just can make INTx work for PCIe.
>


The driver has to configure the device to use message-signaled IRQs, and
CONFIG_PCI_MSI is required.

-- 
Philippe.

_______________________________________________
Xenomai mailing list
[email protected]
https://xenomai.org/mailman/listinfo/xenomai

Re: [Xenomai] kernel crash issues - cobalt mode on ARM A53 with 32bit

Reply via email to